New resources for recognition of confusable linguistic varieties: the LRE11 corpus

نویسندگان

  • Stephanie Strassel
  • Kevin Walker
  • Karen Jones
  • David Graff
  • Christopher Cieri
چکیده

The NIST 2011 Language Recognition Evaluation focuses on language pair discrimination for 24 languages/dialects, some of which may be considered mutually intelligible or closely related. The LRE11 evaluation required new data for all languages, comprising both conversational telephone speech and broadcast narrowband speech from multiple sources in each language. Given the potential confusion among varieties in the collection, manual language auditing required special care including the assessment of inter-auditor consistency. We report on collection methods, auditing approaches, and results. 1. Data Requirements The NIST Language Recognition (LRE) campaigns began in 1996 to with the goal of evaluating performance on language recognition in narrowband speech. The most recent campaign, LRE11, targets language pair discrimination for 24 languages/dialects, some of which may be mutually intelligible to some extent by humans [1]. Data requirements for LRE11 demanded collection of speech sufficient to yield at least 400 narrowband segments for each language. Traditionally LRE evaluations have utilized large collections of conversational telephone speech (CTS). The 2009 LRE corpus represented the first departure from the standard approach in its reliance on narrowband segments embedded in broadcast, typically coming from listener call-ins, phone interviews of pundits and some correspondent reports and man on the street interviews. LRE11 targets collection of both CTS and broadcast narrowband speech (BNBS) for each language, with a few exceptions. Modern Standard Arabic (ara) is a formal variety that wouldn’t typically be spoken during spontaneous conversation and was excluded as a CTS collection target. Conversely, the dialectal Arabic varieties of Iraqi, Levantine and Maghrebi were not expected to appear in formal broadcast news programs and were therefore excluded as a BNBS target. Collection also targeted multiple broadcast sources, where “source” is a provider-program (so Larry King Live is different from CNN Headline News). To satisfy the need for data in languages that might exhibit a high degree of confusability (whether for humans or systems), we reviewed sources including Ethnologue [2] and compiled a preliminary list of candidate languages. Each language was assigned a confusability index score: 1 Throughout the paper we use language as shorthand for a linguistic variety that may be referred to by different sources as a language or dialect. • 0 Not likely to be confusable with another candidate

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Allophone-based acoustic modeling for Persian phoneme recognition

Phoneme recognition is one of the fundamental phases of automatic speech recognition. Coarticulation which refers to the integration of sounds, is one of the important obstacles in phoneme recognition. In other words, each phone is influenced and changed by the characteristics of its neighbor phones, and coarticulation is responsible for most of these changes. The idea of modeling the effects o...

متن کامل

The 2011 NIST Language Recognition Evaluation

In 2011, NIST held the most recent in an ongoing series of Language Recognition Evaluations originating in 1996. The 2011 NIST Language Recognition Evaluation (LRE11) featured 24 languages, including nine languages new to the LRE series, from two different source types, and had participation from 23 research organizations. LRE11 utilized a new evaluation metric, which focused on difficult to di...

متن کامل

Multi-language Speech Collection for NIST LRE

The Multi-language Speech (MLS) Corpus supports NIST’s Language Recognition Evaluation series by providing new conversational telephone speech and broadcast narrowband data in 20 languages/dialects. The corpus was built with the intention of testing system performance in the matter of distinguishing closely related or confusable linguistic varieties, and careful manual auditing of collected dat...

متن کامل

روشی جدید جهت استخراج موجودیت‌های اسمی در عربی کلاسیک

In Natural Language Processing (NLP) studies, developing resources and tools makes a contribution to extension and effectiveness of researches in each language. In recent years, Arabic Named Entity Recognition (ANER) has been considered by NLP researchers due to a significant impact on improving other NLP tasks such as Machine translation, Information retrieval, question answering, query result...

متن کامل

تشخیص دست‌نوشتۀ‌ برخط فارسی با استفاده از مدل زبانی و کاهش قوانین نگارش کاربر

The Joint-up, cursive form of Persian words and immense variety of its scripts, also different figures of Persian letters depending on their sitting positions in the words, have turned the Persian handwritings recognition to an intense challenge. The major obstacle of the most often recognition ways, is their inattention to sentence contexture which causes utilizing of a word with correct appea...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012